专利摘要:
The present disclosure provides a cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross-modality person re-identification network based on hidden space and attribute space is constructed to improve semantic expressiveness of a feature extracted by using a model. To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and make full use of the attribute information of a person.
公开号:NL2028092A
申请号:NL2028092
申请日:2021-04-29
公开日:2021-07-28
发明作者:Gao Zan;Chen Lin;Wang Yinglong;Song Xuemeng;Nie Liqiang
申请人:Shandong Artificial Intelligence Inst;
IPC主号:
专利说明:

-1- CROSS-MODALITY PERSON RE-IDENTIFICATION METHOD BASED ON DUAL-ATTRIBUTE INFORMATION
TECHNICAL FIELD The present disclosure relates to the fields of computer vision and deep learning, and specifically, to a cross-modality person re-identification method based on dual- attribute information.
BACKGROUND In the information age, video surveillance plays an invaluablerole in maintaining public safety. Person re-identification is a crucial subtask in a video surveillance scenario, and is intended to find photos of a same person from image data generated by different surveillance cameras. Public safety monitoring facilities are increasingly widely applied, resulting in massive image data collection. How to quickly and accurately find a target person in the massive image data is a research hotspot in the field of computer vision. However, in some specific emergency scenarios, an image matching a to-be-found person cannot be provided in time as a basis for retrieval, and only an oral description can be provided. Therefore, cross-modality person re- identification based on a text description emerges. Cross-modality person re-identification is to find, in an image library based on a natural language description of a person, an image most conforming to text description information. With the development of deep learning technologies and their superior performance in different tasks, researchers have proposed some deep learning-related cross-modality person re-identification algorithms. These algorithms can be roughly classified into: 1} a semantic intimacy value calculation method, which is used to calculate an intimacy value of a semantic association between an image and text, to improve intimacy between an image and text that belong to a same class, and reduce intimacy between an image and text that belong to different classes; and 2) a subspace method, which is intended to establish shared feature expression space for images and text, and uses a metric learning strategy in the shared feature expression space to decrease a distance between image and text features belonging to a same person identity (ID) and to increase a distance between image and text features belonging to different person IDs. However, semantic expressiveness of features extracted by using these methods still needs to be improved. These methods
-2- ignore or do not fully consider effectiveness of using attribute information of persons to express semantic concepts.
SUMMARY To overcome disadvantages of the above technology, the present disclosure provides a cross-modality person re-identification method by using a space construction and attribute fusion algorithm based on text and image attributes. The technical solution used in the present disclosure to resolve the technical problem thereof is as follows: A cross-modality person re-identification method based on dual-attribute information includes the following steps: a) extracting a text description feature 7 and an image feature { of a person from content obtained by a surveillance camera; Cr b) extracting a text attribute feature from an extracted text description of
C the person, and extracting an image attribute feature from an extracted image; ¢) inputting the text description feature and the image feature of the person in the step a} to shared subspace, calculating a triplet loss function of a hard sample, and calculating a classification loss of a feature in the shared subspace by using a Softmax loss function; d) fusing the text description feature 7 and the image feature { of the person Cr ¢; with the text attribute feature and the image attribute feature ; e) constructing feature attribute space based on attribute information; and f) retrieving and matching the extracted image feature and text description feature of the person. Further, the extracting a text description feature of a person in the step a) includes the following steps:
-3- a-1.1) segmenting words in a description statement of the content obtained by the surveillance camera, and establishing a word frequency table; a-1.2) filtering out a low-frequency word in the word frequency table; a-1.3} performing one-hot encoding to encode a word in the word frequency table; and a-1.4) performing feature extraction on the text description of the person by using a bidirectional long short-term memory (LSTM) model.
Further, the extracting an image feature in the step a) includes the following steps: a-2.1) performing feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set; and a-2.2} performing semantic segmentation on the extracted image, and performing, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.
Further, the step b) includes the following steps: b-1} preprocessing data of the text description of the person by using a natural language toolkit (NLTK) tool library, and extracting a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns; b-2) sorting the extracted noun phrases based on a word frequency, discarding a low-frequency phrase, and constructing an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature ‚and b-3) training the image by using a PA-100K data set, to obtain 26 prediction values, and marking an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image Cr attribute feature . Further, the step c) includes the following steps:
-4- LT) . . (rip : c-1) calculating a triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , 1, eene Ie I where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k represents a feature of the kth text description of the person, k is used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation; lL, 1 c-2) calculating a cosine similarity between ‘* and “% according to a formula [, */ _ I; 1;
TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace;
-5- pe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace. . . L, tent (1, T) 5 c-4) calculating a loss function ‘> of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ” represents a quantity of samples in one batch.
-6- Further, the step d) includes the following steps:
L LT d-1) calculating a loss function corat ( ° ) according to a formula à i 2 Lio (/, T) — 4] P IC, Cl Vv , where the image feature { is constituted by 1, - 7 EO Jr, ©, the text description feature { of the person is constituted by , represents dimensions of Ti and Ti , and | I represents a Frobenius norm; d-2) calculating, according to a formula t =sigmoid(CxU_ + FxU.) g Í 1 ‚ weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, / represents a to-be-fused image or text feature, £ and / are projection matrices, { represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR U, er 5 function, © , : represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space; and d-3) calculating a fused feature A according to a formula A=1x|CxW,| +0-92|[FxW;| Woe RO Ello J 2 g , where , and daxda W.eR represents a projection matrix. Further, the step e) includes the following steps:
-7- : : Ly pind, T) . . e-1) calculating a triplet loss ’ of the attribute space according to a formula ’ _ s sn} K sp Lom(LT)= SY max{ p,+S, (77, | S, (1,7, ).0} I el Ts sm} s sp +3 max( py +5, (77.1, S, (7; i ).0) hel , where Pp s‚() Co represents a boundary of the triplet loss, ° represents cosine similarity [5 Il: calculation, k represents a feature of the k th image in the attribute space, kis
T SH IN used as an anchor, k represents a feature, closest to the anchor fe , of the TT I’ heterogeneous text sample, k represents a feature, farthest from the anchor k 7: , of the congeneric text sample, x represents afeature of the kth text description T° J sn of the person in the attribute space, k is used as an anchor, £ represents a 7: [7 feature, closest to the anchor k , of the heterogeneous text sample, and k T° represents a feature, farthest from the anchor k ofthe congeneric text sample; a, ar e-2)calculating a cosine similarity between * and “% according to a formula xk S,(1,.T,) = te he lele] a a a a I, 7. k “ti where It and Ty respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space; and
-8- Lom T) e-3) calculating a loss function 77 of the attribute space according to L (IT I L I.TY+L LT air ( > ) - arip » )+ cora ( » ) a formula n . Further, the step f} includes the following steps: L{,T f-1) calculating a loss function ( > ) of a dual-attribute network according to a formula L({, T) = Latent (1, 7) + Lin: (/. T) . ALT) f-2) calculating a similarity between dual attributes according to a AULT) = 41, 15 )+A (a .a;) 4, formula ' ' ’ , where represents a calculated similarity between the features > learned from the Ae shared subspace, and © represents a calculated similarity between the features a 7 - a, . k learned from the attribute space; and f-3) calculating cross-modality matching accuracy based on the similarity Al, T,) The present disclosure has the following beneficial effects: The cross-modality person re-identification method based on dual-attribute information extracts rich semantic information by making full use of data of two modalities.
A space construction and attribute fusion algorithm based on text and image attributes is provided.
An end-to-end cross-modality person re-identification network based on hidden space and the attribute space is constructed to improve semantic expressiveness of a feature extracted by using the model.
To resolve a problem of cross-modality person re-identification based on an image and text, a new end-to-end cross-modality person identification network based on the hidden space and the
-9- attribute space is proposed, greatly improving the semantic expressiveness of the extracted feature and making full use of the attribute information of the person.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart of the present disclosure; FIG. 2 shows changes of a loss function in a model training process according to the present disclosure; and FIG. 3 compares the method in the present disclosure and an existing method in terms of Top-k on a CUHK-PEDES data set.
DETAILED DESCRIPTION OF THE EMBODIMENTS The present disclosure is further described with reference to FIG. 1, FIG. 2, and FIG. 3. As shown in FIG. 1, a cross-modality person re-identification method based on dual-attribute information includes the following steps. a) Extract a text description feature 7’ and an image feature 7 of a person from content obtained by a surveillance camera. The present disclosure is intended to establish a semantic association between an image captured by the surveillance camera for a person in a real scenario and a corresponding text description of the person. Feature representations of data of two modalities need to be extracted separately. The image feature is extracted by using a currently popular convolution neural network ResNet, and the text feature is extracted by using a bidirectional LSTM, so that text context information can be fully obtained. Cr b) Extract a text attribute feature from an extracted text description of the cj person, and extract an image attribute feature from an extracted image. To resolve a problem that a semantic expressiveness of a feature is poor because an existing method does not make full use of attribute information, the present disclosure is designed to use attribute information of the person as auxiliary information to improve the semantic expressiveness of image and text features. An image attribute of the person is extracted by using an existing stable person-specific
-10- image attribute extraction model. A text attribute of the person comes from statistical information in a data set, and a noun phrase with a relatively high word freguency in the data set is used as the text attribute of the person in the present disclosure.
c} Input the text description feature and the image feature of the person in the step a) to shared subspace, calculate a triplet loss function of a hard sample, and calculate a classification loss of a feature in the shared subspace by using a Softmax loss function. Projection to shared vector space is a frequently used method for resolving a cross-modality retrieval problem. In the shared vector space, an association between data of two modalities can be established. The present disclosure projects the extracted image and text features to the shared vector subspace, and adopts metric learning to decrease a distance between image and text features with same person information and increase a distance between image and text features belonging to different persons. The present disclosure uses a triplet loss of the hard sample to achieve the above purpose. That is, in a batch of data, a heterogeneous sample of another modality and closest to anchor data, and a congeneric sample of the another modality and farthest from the anchor data need to be found.
d) Fuse the text description feature 7 and the image feature { of the person . . Cr . : C with the text attribute feature and the image attribute feature ‚ The existing method does not make full use of an auxiliary function of the attribute information or uses only attribute information of one modality, resulting in poor semantic expressiveness of a feature that can be extracted by using the model. To resolve this problem, the present disclosure uses the extracted dual-attribute information, namely, the image and text attributes. Considering that different attributes play different roles in image and text matching of the person, the present disclosure uses a weight mechanism to enable semantic information to play a more important role in feature fusion. The present disclosure uses a strategy of matrix projection to project to-be-fused image and text features and attribute features to same dimensional space, and then weights the two types of features to obtain image
-11- and text features fused with the semantic information. Before feature fusion, to avoid a large difference between distribution of features of two modalities, the present disclosure uses a frequently used loss function coral to minimize a difference between distribution of data of two modalities.
e) Construct feature attribute space based on the attribute information, which is referred to as attribute space in the present disclosure. The image and text features fused with the semantic information are also sent to the shared subspace. In the present disclosure, the image and text features with the same person information have a same semantic meaning by default. In the attribute space, the present disclosure still uses the triplet loss of the hard sample to establish a semantic association between image and text features that are of the person and are of different modalities.
f) Retrieve and match the extracted image feature and text description feature of the person. The finally extracted image and text features in the present disclosure include features extracted from hidden space and features extracted from the attribute space. When the extracted model features are retrieved and matched, a cosine distance is used to calculate a distance between two model features in feature space, to measure their similarity. To make ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space complementary, the present disclosure adds up similarity matrices of the two types of features before sorting.
To resolve a problem that the existing cross-modality person re-identification method cannot effectively use the attribute information of the person as the auxiliary information to improve the semantic expressiveness of the image and text features, the present disclosure provides an efficient cross-modality person re-identification method based on dual-attribute information, to extract rich semantic information by making full use of data of two modalities, and provides a space construction and attribute fusion algorithm based on text and image attributes. An end-to-end cross- modality person re-identification network based on the hidden space and the attribute space is constructed to improve the semantic expressiveness of the feature extracted by using the model. To resolve a problem of cross-modality person re- identification based on an image and text, a new end-to-end cross-modality person
-12- identification network based on the hidden space and the attribute space is proposed, to greatly improve the semantic expressiveness of the extracted feature and make full use of the attribute information of the person.
Embodiment 1 The extracting a text description feature of a person in the step a} includes the following steps: a-1.1) Preprocess text information when performing feature extraction on text of the person, in other words, segment words in a description statement of the content obtained by the surveillance camera, and establish a word frequency table.
a-1.2) Filter out a low-frequency word in the word frequency table.
a-1.3) Perform one-hot encoding to encode a word in the word frequency table.
a-1.4) Perform feature extraction on the text description of the person by using a bidirectional LSTM model. The bidirectional LSTM model can fully consider a context of each word, so that richer text features are learned.
The extracting an image feature in the step a) includes the following steps: a-2.1) Perform feature extraction on the image by using a ResNet that has been pre-trained on an ImageNet data set.
a-2.2} Perform semantic segmentation on the extracted image, and perform, by using the ResNet in the step a-2.1), feature extraction on an image obtained after semantic segmentation.
Embodiment 2 Many efforts have been made for person-specific image attribute identification, and good effects have been achieved. The present disclosure uses a stable person- specific attribute identification model to extract an attribute contained in an image of a person and a possible attribute value in the data set. The step b) includes the following steps: b-1) Preprocess data of the text description of the person by using an NLTK tool library, and extract a noun phrase constituted by an adjective plus a noun and a noun phrase constituted by a plurality of superposed nouns.
-13- b-2} Sort the extracted noun phrases based on a word freguency, discard a low- freguency phrase, and construct an attribute table by using the first 400 noun Cr phrases, to obtain the text attribute feature . b-3) Train the image by using a PA-100K data set, to obtain 26 prediction values, and mark an image attribute with a prediction value greater than 0 as 1 and an image attribute with a prediction value less than 0 as 0 to obtain the image attribute feature C, | Embodiment 3 The present disclosure uses a frequently used shared subspace method in the field of cross-modality person re-identification to establish an association between feature vectors of two modalities.
The hidden space is set to enable both the image feature and the text feature of the person to have separability of a person ID, and to enable the image and text features to have a basic semantic association.
Considering that, in cross-modality person-specific image and text retrieval, a same person ID corresponds to a plurality of images and a plurality of corresponding text descriptions, the present disclosure designs the loss function to decrease a distance between an image and a text description that belong to a same person ID, and increase a distance between an image and text that belong to different person IDs.
Specifically, data of one modality is used as an anchor.
Data that is of another modality and belongs to a type the same as that of the anchor is used as a positive sample, and data belonging to a type different from that of the anchor is used as a negative sample.
This not only realizes classification, but also establishes, to a certain extent, a correspondence between an image and a text description that have same semantics but are of different modalities.
In an experiment, an image and a text description of a same person have same semantic information by default.
The step c) includes the following steps:
-14- : Lip (1, 7) : c-1) Calculate the triplet loss of the hard sample according to a _ í Nn _ Pp Leip 1.7) == > max(p, + SUT, ) S(,,T; ), 0) I.el ! n _ “ Pp +> max (po, +5(7,..1;)=S(T;..1;).0) formula Tiel , I, kth; I, . 1’ where © represents a feature of the © ™ image, " is used as an anchor, represents a feature, closest to the anchor ko of a heterogeneous text sample, Tr I! k k : represents a feature, farthest from the anchor , of a congeneric text sample, k representsa feature of the k th text description of the person, kis used as an I’ T k k anchor, represents a feature, closest to the anchor , of the heterogeneous I 7, text sample, “ represents a feature, farthest from the anchor , of the . S congeneric text sample, Pr represents a boundary of the triplet loss, and ( ) represents cosine similarity calculation. /, be c-2) Calculate a cosine similarity between * and + according to a formula [, */ _ I; 1;
TTT I 7, ‚ . k “Il where Jt represents a feature of the kK image in ly ‘ i. the shared subspace, and + represents a feature of the # ™ text description of the person in the shared subspace.
-15- fe . L,I.) . I, : ¢-3) Calculate a classification loss © ** of the image feature in the shared subspace according to a formula exp(l/,"W_ +b p In Jk vk Las (Z,) = log( 7 T Ww b ) 7 > j=LC exp I J + |) l, ” ‚where represents a transposed image feature in the shared subspace, Ww represents a classifier, dix! WeR ‚ di represents a feature dimension of the shared subspace, C represents a quantity of ID information classes of the person, Jk represents ID 7 [, W. information of “& b represents a bias vector, J represents a classification | bh | Ww, vector of the Jt! class, J represents a bias value of the Jt class, + . a ok Um represents a corresponding classification vector of the Yt class, represents a LT, bias value of the YF class; and calculate a classification loss function as £) of Ln I, | the text description feature of the person in the shared subspace according to a expll,"W +b p Tk yk yk Los (1;) - log( T ) > =I. exp W, + b,) Lt formula Jen : ’ ‚ where Ti represents a transposed text feature in the shared subspace.
Laten LT) c-4) Calculate a loss function er! of the shared subspace according 1 1 | | Latent (1, Tr) n Lip (J, 1) + n > (Ls (1) + Los (7,)) to a formula ok , where ” represents a quantity of samples in one batch.
-16- Embodiment 4 Before the image and text features are fused with the attribute features, to avoid an excessive difference between distribution of data of two modalities, the present disclosure uses the coral function in transfer learning to decrease a distance between the data of the two modalities.
Specifically, the step d) includes the following steps: L (IT d-1) Calculate a loss function cora ’ ) according to a formula à i 2 Loret U, T) == 4] 2 IC, Cl y| , where the image feature / is constituted by Á, u , a Jr *, the text description feature { of the person is constituted by vo, ioe Sit ang i represents dimensions of © and “©, and represents a Frobenius norm, d-2) Calculate, according to a formula t = sigmoid(CxU, + FxU,) S . 3 3 , weights of the attribute feature and the image or text feature during feature fusion, where C represents a to-be- fused attribute feature, I represents a to-be-fused image or text feature, & and J are projection matrices, l represents a weight, during feature fusion, obtained by adding up projection results and processing an obtained result by using a sigmoid sxda daxda U, eR”“ U, eR S function, © ‚ - represents a projection matrix, represents a quantity of image attribute classes or text attribute classes, and da represents a feature dimension of the attribute space.
-17- d-3) Calculate a fused feature 4 according to a formula A=t1=|CxW,| +10) |FxW | We RV g 2 f 2 g , where 5 ‚ and daxda W,eR ” represents a projection matrix.
Embodiment 5 In the hidden space, the triplet loss is used to establish the association between the image feature and the text feature. in the attribute space, the triplet loss of the hard sample is used to establish a semantic association between features of different modalities.
Therefore, the step e) includes the following steps: . Li > (/, 1) . . e-1) Calculate a triplet loss “77% of the attribute space according to a == Ss sn} Ss Sp LoL T)=Y max p, +S. (7; 7 S, (7,7, | 0) I el Ts SH _ & Sp +> max{9, +5, (1, vi 5.7 yh ).0) formula Teel . S . , where Pp represents a boundary of the triplet loss, J ) represents cosine [5 similarity calculation, k represents a feature of the k th image in the attribute [5 T SH space, K is used as an anchor, k represents a feature, closest to the anchor I 7% k k , of the heterogeneous text sample, represents a feature, farthest from the I Ts anchor * , of the congeneric text sample, k represents afeature of the kth text T° J Sn description of the person in the attribute space, £ is used as an anchor, £ Tk represents a feature, closest to the anchor k of the heterogeneous text sample,
-18-
1.7 1’ and £ represents a feature, farthest from the anchor ~ ©, of the congeneric text sample, a; a e-2)Calculate a cosine similarity between ‘* and “* according to a formula % S,(1,.1,) = SE Jalen | ad ad a a I 7. k “4 where Lk and Tk respectively represent an image feature with semantic information and a text feature with semantic information that are obtained after attribute information fusion in the attribute space. Lom 1. T) e-3) Calculate a loss function ¢ "7 of the attribute space according to a LT L 1.T)+L LT ain ( * ) = arin ( 2 )+ corat ( 2 ) formula n . Embodiment 6 In a process of model learning, the hidden space and the attribute space are trained at the same time. The step f) includes the following steps:
LT f-1) Calculate a loss function ( ) of a dual-attribute network according to LLT) = LI + LT a formula ( > ) tent 1) car ’ ) As shown in FIG. 2, change curves of the three loss functions in a training process are roughly consistent. This proves applicability and rationality of the present disclosure. f-2) To make the ID information, of the person, learned from the hidden space and the semantic information, of the person, learned from the attribute space . sr A(T) complementary in a test process, calculate a similarity * between dual attributes according to a formula
-19- ALT) =A(1, 1, )+4 (9, a) A, ‚ where represents a calculated similarity between the features Le " Te earned from the shared subspace, and Ac represents a calculated similarity between the features a, > da, .
k 4 learned from the attribute space. f-3) Calculate cross-modality matching accuracy based on the finally obtained ALT) similarity “ It is proved that, as shown in FIG. 3, performance of the method in the present disclosure is significantly improved compared with performance of the existing five methods listed in the table. The above embodiments are only used for describing the technical solutions of the present disclosure and are not intended to limit the present disclosure. Although the present disclosure is described in detail with reference to the embodiments, those of ordinary skill in the art should understand that various modifications or equivalent substitutions may be made to the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions of the present disclosure, and such modifications or equivalent substitutions should be encompassed within the scope of the claims of the present disclosure.
权利要求:
Claims (8)
[1]
-20- Conclusions L Method for re-identification of persons across multiple modalities based on bipartite attribute information, comprising the following steps: a) extracting, from content obtained by a surveillance camera, a text description feature 7" and an image feature / of a person; b) extracting, from a featured text description of the person, a Cr l Ln Ln text attribute attribute and extracting, from a featured image, a C, image attribute attribute; c) entering, in shared subspace, the text description feature and the image feature of the person in step a) calculating a triple loss function of a hard sample and, using a Softmax loss function, calculating a classification loss of a feature in the shared subspace; d) concatenating the text description attribute 7" and the image attribute
C I of the person with the text attribute attribute Ten the C, image attribute attribute ; e) constructing feature attribute space based on attribute information; and f) retrieving and fitting the highlighted image feature and the text description feature of the person.
[2]
The method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein extracting a text description feature of a person in step a) comprises the steps of: a-1.1) segmenting words in a description entry of the contents obtained by the surveillance camera, and establishing a word frequency table, a-1.2) filtering out a low frequency word in the word frequency table, a-1.3) performing one-hot- “) encoding to place a word in the
221 - word frequency table to be coded; and a-1.4) performing feature highlighting on the person's text description using a two-way Long Short Term Memory (LSTM) model.
[3]
The method of re-identification of persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the extraction of an image feature in the step a) comprises the following steps: a-2.1) using a ResNet that has been pre-trained is on an image net dataset, performing feature extraction on the image; and a-2.2) performing semantic segmentation on the extracted image, and using ResNet in step a-2.1), performing feature highlighting on an image obtained after semantic segmentation.
[4]
The method for re-identifying individuals across multiple modalities based on bipartisan attribute information according to claim 1, wherein step b) comprises the following steps: b-1) using a natural language toolkit, NLTK -) tool library, preprocessing data from the person's text description, and extracting a noun phrase made up of an adjective plus a noun and a noun phrase made up of a plurality of superimposed nouns; b-2) sorting the featured noun phrases based on a word frequency, discarding a low-frequency phrase, and constructing, using the first 400 noun phrases, a Cr... attribute table to obtain the text attribute attribute; and b-3) using a PA-100K dataset, training the image to obtain 26 prediction values, and marking an image attribute with a prediction value greater than 0 as 1 and an image attribute with a cj prediction value which is less than 0 as 0 around the image attribute attribute
-22- available.
[5]
The method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step c) comprises the following steps: . . L,, (1 2 T ) c-1) calculating a triple loss of the hard sample according to ’ ! — TH a. 'p Lip 1,7) =_ > max(, +S(1, 1; ) S(T, ).0) I el : ny p +> max(p, +S(T,. 1;)=S(T,.1!'),0) Ter a formula ° ‚ where k represents a feature of the ke image, where k as an anchor . . I' . . I, is used, where a feature represents a feature closest to the anchor Tr of a heterogeneous text sample, wherein a feature represents a feature furthest from the anchor k of a similar text sample, where a feature of the ke text description of the person, where k is used as an I” anchor, where © represents a feature closest to the anchor T, | Ip of the heterogeneous text sample, where * represents a feature furthest from the anchor k of the similar text sample, where Pr represents a triple loss limit and ( ) represents a cosine resemblance calculation; c-2) calculating a cosine similarity between Te and Tr according to a formula
-23- [, */ I. 7. S ( I, i T, ) — lk k | |A and m , where * represents a feature of the k-e map in the shared subspace and where Ti represents a feature of the k-e text description of the person in the shared subspace; c-3) calculating a classification loss cls ( k ) of the image feature k in the shared subspace according to a formula 7 CXp (47 Woe + D | ex +0. Dc p IJJ .. I ‚ where a transposed image feature in the shared subspace, where W represents a dIxC classifier, where WeR , dl represents an attribute size of the shared subspace, where C represents information about quantity classes of identity (ID) of a person, where Jk represents ID information of
T represents, where 7 represents a bias vector, where © represents a classification vector of the /* class, where J represents a bias value of the j-th class, .. We . . .
where represents a corresponding classification vector of the vk-e class and where * represents gene bias value of the Vvk-e class, and calculating a classification loss function as 0) of the text description feature k of the person in the shared subspace according to a formula
24.7 exp; Wa +b,] | Ly (Ti) = wap TW ) €X . +0, T ‚ where k represents a transposed text attribute in the shared subspace; and c-4) calculating a loss function aen! ( ’ ) of the shared subspace according to a formula 7 1 1 latent (7, 1) - Lip (J, T) +— > (Ls Ji) + Los(7; ) hn nk where a quantity of samples in one batch represents.
[6]
A method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 5, wherein the step d) comprises the following steps: d-1) calculating a loss function Lora (1,7) according to a formula 1 2 Lorat (1, 1) = 4] 2 IC, u Cy I. 7, v| | | , wherein the image characteristic / is built up from Zx, where the text description feature 7' of the person is built up from J, where dimensions of and - represent and where a Frobenius norm is represented; t =sigmoid(CxU, + FxU,) d-2) calculating, according to a formula 5 : , weightings of the attribute attribute and the image or text attribute during attribute fusion, where C' represents an attribute attribute to be fused, where 1” | CU, U, represents an image or text feature to be fused, where & and are projection matrices, where I represents a weighting during feature fusion obtained by summing projection results and processing a
25. sxda
U ER result obtained using a sigmoid function, ‚ where daxda U, eR * represents a projection matrix, where S represents a quantity of image attribute classes or text attribute classes and where da represents an attribute size of the attribute space; and d-3) calculating a fused feature A according to a formula fx _ | xd A=ts|CxW‚| +(=0)x|FxW | CW, eR ‚ where < and daxda W, € R Co : represents a projection matrix.
[7]
The method for re-identifying persons across multiple modalities based on bipartite attribute information according to claim 1, wherein the step e) comprises the following steps: . . Ly (/, 7). . e-1) calculating a triple loss of the attribute space according to a formula _ § Si _ § sp Lip) -> max p, + S, (7,7, ) S, (2.7; | 0) Count s sm} s sp +3 max, +5, (7; oh ‚(7 Ay ).0) Count where Ph represents a limit of triple loss, fl ) cosine resemblance calculation
I, where K represents a feature of the k-e map in the attribute space, I T sh where k is used as an anchor, where * represents a feature closest to the anchor K of the heterogeneous text sample, where * and
I represents characteristic furthest from the anchor k of the like
226 - T° text sample lies, where Kean attribute of the person's ke text description represents 77 7 SH in the attribute space, using k as an anchor, where ~ * T° represents an attribute closest to the anchor k of the heterogeneous text sample IL” I; and where “© represents a feature furthest from the anchor K of the similar text sample; Lo a, dp e-2) calculating a cosine resemblance between '* and * according to a formula a, *a ST) TT |z|] da a a a I Te .. . .
h MH where h h respectively represent an image feature with semantic information and a text feature with semantic information obtained after attribute information fusion in the attribute space; and e-3) calculating a loss function ark ’ ) of the attribute space according to LT I L LT)+L LT air ( 3 ) - trip | > )+ corar ( ) a formula n
[8]
The method for re-identifying persons across multiple modalities based on two-part attribute information according to claim 1, wherein the step f) comprises the steps of: f-1) calculating a loss function LU, T) of a two-part LLT) = L CN +L (IT attribute network according to a formula (') latent (') attr ('): f-2) calculating a similarity | kok ) between dual attributes according to Al, T)=A4(l, 1; )+4 (a, ay) 4 a formula ‚ where a
_27- computed similarity between the attributes Te " Te represents learned from the shared subspace and where 4 represents a computed similarity between the attributes a, 547 , , yhk learned from the attribute space; and f-3) the computation of a fit accuracy across multiple modalities based on the similarity (ke 0) :
类似技术:
公开号 | 公开日 | 专利标题
Srihari1995|Automatic indexing and content-based retrieval of captioned images
CN104063683B|2017-05-17|Expression input method and device based on face identification
US9430719B2|2016-08-30|System and method for providing objectified image renderings using recognition information from images
Srihari et al.2000|Intelligent indexing and semantic retrieval of multimodal documents
US8897505B2|2014-11-25|System and method for enabling the use of captured images through recognition
US7809192B2|2010-10-05|System and method for recognizing objects from images and identifying relevancy amongst images and information
US7809722B2|2010-10-05|System and method for enabling search and retrieval from image files based on recognized information
Iyengar et al.1997|Models for automatic classification of video sequences
US20090289942A1|2009-11-26|Image learning, automatic annotation, retrieval method, and device
EP2005362A1|2008-12-24|Identifying unique objects in multiple image collections
EP2245580A1|2010-11-03|Discovering social relationships from personal photo collections
NL2028092A|2021-07-28|Cross-modality person re-identification method based on dual-attribute information
Li et al.2019|Learning to learn relation for important people detection in still images
CN112347223A|2021-02-09|Document retrieval method, document retrieval equipment and computer-readable storage medium
CN111046732A|2020-04-21|Pedestrian re-identification method based on multi-granularity semantic analysis and storage medium
CN109165563B|2021-03-23|Pedestrian re-identification method and apparatus, electronic device, storage medium, and program product
CN113177612A|2021-07-27|Agricultural pest image identification method based on CNN few samples
WO2006122164A2|2006-11-16|System and method for enabling the use of captured images through recognition
CN113158777A|2021-07-23|Quality scoring method, quality scoring model training method and related device
Sahbi et al.2000|From coarse to fine skin and face detection
CN107633362B|2020-11-20|Method and system for expressing connection mode between enterprise elements based on biological characteristics
CN111026842A|2020-04-17|Natural language processing method, natural language processing device and intelligent question-answering system
Zhang et al.2018|A Novel Approach for Annotation-based Image Retrieval Using Deep Architecture.
CN112801054B|2021-06-22|Face recognition model processing method, face recognition method and device
Jagtap2012|An improved processing technique with image mining method for classification of textual images using low-level image features
同族专利:
公开号 | 公开日
CN112001279A|2020-11-27|
CN112001279B|2022-02-01|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US9400925B2|2013-11-15|2016-07-26|Facebook, Inc.|Pose-aligned networks for deep attribute modeling|
GB201703602D0|2017-03-07|2017-04-19|Selerio Ltd|Multi-Modal image search|
CN107562812B|2017-08-11|2021-01-15|北京大学|Cross-modal similarity learning method based on specific modal semantic space modeling|
CN109344266B|2018-06-29|2021-08-06|北京大学深圳研究生院|Dual-semantic-space-based antagonistic cross-media retrieval method|
US11138469B2|2019-01-15|2021-10-05|Naver Corporation|Training and using a convolutional neural network for person re-identification|
CN109829430B|2019-01-31|2021-02-19|中科人工智能创新技术研究院有限公司|Cross-modal pedestrian re-identification method and system based on heterogeneous hierarchical attention mechanism|
CN110021051B|2019-04-01|2020-12-15|浙江大学|Human image generation method based on generation of confrontation network through text guidance|
CN110321813A|2019-06-18|2019-10-11|南京信息工程大学|Cross-domain pedestrian recognition methods again based on pedestrian's segmentation|
CN110909605A|2019-10-24|2020-03-24|西北工业大学|Cross-modal pedestrian re-identification method based on contrast correlation|CN113627151B|2021-10-14|2022-02-22|北京中科闻歌科技股份有限公司|Cross-modal data matching method, device, equipment and medium|
法律状态:
优先权:
申请号 | 申请日 | 专利标题
CN202010805183.XA|CN112001279B|2020-08-12|2020-08-12|Cross-modal pedestrian re-identification method based on dual attribute information|
[返回顶部]